A machine learning approach to predict self-protecting behaviors during the early wave of the COVID-19 pandemic

Using a unique harmonized real‐time data set from the COME-HERE longitudinal survey that covers five European countries (France, Germany, Italy, Spain, and Sweden) and applying a non-parametric machine learning model, this paper identifies the main individual and macro-level predictors of self-protecting behaviors against the coronavirus disease 2019 (COVID-19) during the first wave of the pandemic. Exploiting the interpretability of a Random Forest algorithm via Shapely values, we find that a higher regional incidence of COVID-19 triggers higher levels of self-protective behavior, as does a stricter government policy response. The level of individual knowledge about the pandemic, confidence in institutions, and population density also ranks high among the factors that predict self-protecting behaviors. We also identify a steep socioeconomic gradient with lower levels of self-protecting behaviors being associated with lower income and poor housing conditions. Among socio-demographic factors, gender, marital status, age, and region of residence are the main determinants of self-protective measures.

In line with these model-based predictions, there is evidence suggesting that the most important determinants of self-protecting behaviors are socioeconomic factors. Papageorge and colleagues 7 find that in the US, the costs of self-protective behaviors are unevenly distributed across socio-demographic groups such that economically disadvantaged individuals, who presume low risk of adverse health effects, are less likely to engage in protective behaviors. Ashraf 8 shows a strong negative association between COVID-19 cases and socioeconomic conditions in a panel of 77 countries. Previous results also show that socioeconomic disparities in health behaviors create an imbalance between individual responses and socially optimal outcomes [9][10][11] .
Other factors influencing protective behavior that have been identified in the related literature include average neighborhood income 12 , internet access 13 , and political affiliation [14][15][16] . Other studies emphasize the role of perceived severity of COVID-19 and perceived susceptibility to it [17][18][19] , self-efficacy 20 , demographics 21 , and information source and credibility 22 .
The second strand of the literature that our paper relates to is the growing number of studies that apply a machine learning approach in social science investigations to detect fuzzy relations in complex data sets. Machine learning is a computational method that uses experience to improve performance on a given task. It follows a flexible algorithmic approach to learn the experience from the training data. The main focus of machine learning techniques is to perform prediction (regression) and classification, as well as clustering (grouping) 23 . For instance, recent studies in the social sciences have used machine learning methods to estimate how the effect of a particular intervention differs across characteristics of individuals [24][25][26][27][28] . With regards to the COVID-19 pandemic, a few studies employed machine learning methods to isolate and predict key indicators of compliance with social distancing rules 29 , fear and perceived health 30 , and psychological distress 31 .
Through the combination of these two research strands, our study makes several contributions to the existing literature. First, we use a combination of complex non-parametric machine learning model and state-of-the-art model explanation method to explain factors impacting the adoption of self-protecting behaviors during the COVID-19 pandemic. To the best of our knowledge this is the first attempt in the literature. Second, we demonstrate the advantages and relative gains of a tree-based algorithm over linear regression. Third, we train a highly predictive model with original data with a universe of items specifically constructed to measure behavioral change in response to COVID-19. This allows us to minimize the bias from measurement error, which is a common limitation in related studies that only focus on a few measures. Moreover, we use causal theory to justify the reason for including or omitting variables, and our data offer a large number of features. This enables us to minimize bias in the partial effects of features that may occur when an important variable is omitted in a causal model 32,33 . Fourth, our approach allows for the presence of interaction effects among key features. And finally, we identify key policy relevant individual and social predictors of self-protecting behaviors and document their heterogeneity by country.

Data
Our study primarily relies on the COME-HERE survey as a data source. COME-HERE is a longitudinal survey conducted by a group of researchers at the University of Luxembourg to investigate the psychological and socioeconomic effects of the COVID-19 pandemic and related social distancing measures across Europe. The survey is administered by Qualtrics on nationally representative samples of individuals in France, Germany, Italy, Spain, and Sweden. This study was approved by the Ethics Review Panel of the University of Luxembourg (reference number: ERP 20-026-C COME-HERE). All research was carried out in accordance with relevant guidelines, and informed consent was obtained from all participants. Respondents were asked to complete an online questionnaire that took approximately 20 min. The survey collects information at the individual and household levels. The first wave of the survey took place in late April and early May 2020, and over 8000 people participated. The same pool of individuals was re-contacted nine additional times in June, August, and December 2020, in March, June, and October 2021, in February, June and December 2022 (see Vögele and colleagues 34 for more information).
The present results are based on the first wave of the survey with over 8000 participants who answered a range of questions on their sociodemographic characteristics, socioeconomic variables, pre-existing health conditions and risk behaviors, mental health, resilience, social support, and trust in major institutions. This dataset is representative of the population in the countries in terms of age, gender and region of residence. In the next paragraphs, we outline in greater detail how we define the variables and features used in the prediction exercises.
Our outcome variable is a composite measure of self-protecting behaviors. The measure is constructed from a range of questions from the specially designed Coronavirus Behavior Scale ( CBS ). The CBS is a 14-item selfreport measure of behavior change due to the Coronavirus pandemic. It contains two subscales, with nine items assessing reasonable behaviors (e.g., shaking hands less) and five items assessing unreasonable behaviors (e.g., buying more toilet paper than usual) that was particularly developed for the purpose of this and similar studies. Responses are given on a 5-point Likert scale ranging from 1 strongly disagree to 5 strongly agree. We focus our analysis on the reasonable behavior subscale. This subscale is based on responses to the following questions: "Because of coronavirus, I am planning to or have already (1) cleaned and disinfected surfaces in my home more often, (2) bought medical masks, (3) stayed home when I feel ill, (4) started avoiding crowded spaces, (5) bought disinfectant, (6) started washing my hands more, and (7) shaken hands with people less". We compute a composite index from the responses to the seven items of the reasonable CBS subscale to define self-protecting behaviors of individual i (CBS i = 1 www.nature.com/scientificreports/ Table 1 presents the summary statistics of key characteristics of individuals in the pooled and country level samples. In the pooled sample, 8063 respondents were included, and the sample mainly consisted of workingage individuals (mean age of 47.5). Approximately 55.4% of respondents reported ties to the labor market (44% employed full-time and 11.5% part-time). Twenty percent of the sample are unemployed, and about 23.7% of the participants are retired. Nearly 58% are married and cohabiting, 8.7% are married but living apart, and 33.4% of the respondents are single. Close to 9% of the participants reported that they live in a house without any open-air access (such as garden, terrace, or balcony). Nearly 52% of respondents report at least one pre-existing health condition. In terms of household income, 21% of the respondents are in relative poverty. See Figure

Method
Many features contribute to self-protecting behaviors against COVID-19. Our objective is to investigate this complex relationship with a high level of accuracy. We consider vectors of features capturing socioeconomic status (such as income and home features), pre-existing health conditions, health and behavioral risk factors, socio-demographic factors, severity of the disease and government policy responses, trust in major institutions, and knowledge about the pandemic. Modelling this relationship with a parametric approach requires the researcher to make certain assumptions about the data-generating process, imposing structure on the data, which may eventually lead to functional misspecification. With supervised ML algorithm as used in this study, we do not need to assume any structure about the data-generating process but directly learn a function that maps the input features to the target variable. The learning process is controlled with hyperparameters that regulate some aspect of the learning algorithm. These hyperparameters are commonly optimized via a grid search over the hyperparameter space.
In this study, we build a Random Forests (RF) model and complement it with model-agnostic interpretative tools to ensure interpretability of our results along the lines of explainable ML (ExpML). See Fig. E.1 in Supplementary Appendix E for an overview of overall workflow. That is, we favor Random Forests (RF) over other ML algorithms because they are easily scalable to accommodate a large dataset with higher dimensionality without losing statistical efficiency. They are also shown to be state-of-the-art algorithms in tabular datasets with categorical variables 35 . Random forests can naturally capture complex interactions between features and their predictions are robust to outliers due to repeated sampling. RF is a collection of many de-correlated decision trees. Trees are grown on a bootstrapped sample with a random subset of feature vectors. In the next step, the final predictions are produced as a mean value of predictions from each tree (in a classification problem, the final predictions are yielded based on a majority vote). The steps in a typical RF algorithm are as follows: (i) Draw a bootstrap sample from the training data and randomly select k variables from p variables, where k < < p. (ii) Select the best split among the k variables. The maximum number of k is a hyperparameter. (iii) Split the node into two daughter nodes. (iv) Repeat step i to iii until the terminal nodes have been reached-until no farther splits are possible.
(v) Repeat the above steps for T number of times to grow a forest of T trees. In this study, all predictions are done with a popular library for the Python programming language called scikit-learn (version 0.22.2). Scikit-learn is an open-source project containing several state-of-the-art ML algorithms and their extensive documentation 36 .
We build a RF model consisting of 500 trees. To control over-fitting, we apply a grid search with fivefold cross-validation and identify the optimal hyperparameters. We optimize two important hyperparameters that will ensure randomness in RF: the maximum depths of trees ('max depth') and maximum number of features used in each split ('max features'). The deeper the tree, the more splits it has, and it captures more information about the data. Similarly, by choosing a reduced number of features we can increase the stability of the tree and reduce variance and over-fitting.
The dataset is randomly divided into training (80%) and testing (20%) sets. For each combination of hyperparameters, we undertake the following three steps: (i) randomly partition the training set into five subsamples, (ii) train the models on the four subsamples and generate predictions for the fifth, and (iii) repeat this process five times so that each subsample is used only once to generate the prediction. The optimal hyperparameters are updated based on the average of these five prediction results. The tuning step ends when we find an optimal hyperparameter that produces minimum average prediction errors. For instance, the selected best RF model in the pooled dataset has a cross-validation test score of 0.7 Root Mean Squared Errors (RMSE) (The explored hyperparameters space, the cross-validation results, and the final hyperparameter combination used in the final model are available on request from the authors).
We assess the predictive accuracy of each Random Forests and a baseline OLS model because having a reliable estimate of predictive performance (e.g., a significant relative performance gain from using RF vis-a-vis the baseline model employing OLS) is an essential requirement for the socioeconomic interpretation of the results of this study. To quantify the extent to which the predicted value for a given respondent is close to the actual value of that individual, we use the most common metrics in regression settings: mean absolute errors (MAE) and root mean squared errors (RMSE). We use MAE to show how the model fares when prediction errors are linearly weighted. The best model will be the one that has a smaller value in both MAE and RMSE (ideally close to zero), meaning that it produces predictions that are close to the true responses. In our data, both metrics produce similar conclusions.
We find that the RF model outperforms OLS in all prediction tasks in both the pooled and per-country datasets (see Table D.1 in the Supplementary Appendix for detailed results). The superior prediction performance of RF shows that we could capture signals relevant to self-protecting behavior, which the linear model failed to capture (as measured by the percentage reduction of the prediction error).
Finally, we use the visualization tool SHapley Additive exPlanations (SHAP) proposed by Lundberg and Lee 37 to explain the contribution of each feature to the prediction of self-protecting behaviors using Shapley values. SHAP is based on a solution concept in a cooperative game setup that aims to 'fairly' allocate the gains among players as suggested in the seminal work of 38 . SHAP has the advantage of consistency and provides both local and global interpretability 36,39,40 .

Results
Identifying the top 30 predictors. We identify the top 30 features in predicting self-protecting behaviors. www.nature.com/scientificreports/ in descending order of average contribution to the prediction (i.e., the global contribution measured in mean absolute SHAP value). On the x-axis the SHAP values for each observation are presented-negative SHAP values are interpreted as reduced self-protecting behavior, while positive SHAP values are interpreted as increased level of self-protecting behaviors. Each dot represents an individual respondent; hence, the number of dots against each feature reflects the sample size of the training set. The dot's position along the x-axis is the feature's impact on the model's prediction for that respondent. When multiple dots arrive at the same coordinate in the plot, they pile up to show the density of effect sizes. The long-left tails in the summary plot indicate that the predictors are highly predictive for some respondents but not others, i.e., predictors with minor global importance can still be very important for specific respondents.
In Fig. 1 panel (b), we summarize our key findings for an easier global explanation of the impact of the features on the model and their association with self-protecting behaviors. The horizontal length of each bar shows the magnitude of impact on the model. The correlations of each feature with self-protecting behaviors are shown by the color code of each bar. We can see that features such as total number of deaths from COVID-19, stringency index (see Supplementary Appendix A for the variable definition), knowledge about COVID-19, being female, being married and co-habiting, having a pre-existing high blood pressure (HBP) and being smoker (including those who currently smoke or ex-smokers) have strong positive associations (correlation coefficient ≥ 0.9) with self-protecting behaviors, whereas being single, not experiencing COVID-19 induced work-related changes, not having pre-existing health conditions, not smoking, residing in Sweden, and working in essential public services all show strong negative associations (correlation coefficient ≤ − 0.9) with self-protecting behaviors. We also perform a robustness check of feature ranking using an alternative method to SHAP, feature importance (see Fig. C.1 in Supplementary Appendix C for feature ordering using permutation feature importance). Note that there is a big difference between these two alternative methods: permutation feature importance is based on the decrease in model performance whereas SHAP is based on the magnitude of feature attributions.
In the next subsections, we examine how each of the top 30 features contributes to the model's output. Since self-protecting behaviors are likely to depend on locally relevant perceptions of threats, susceptibility, and www.nature.com/scientificreports/ benefits in engaging in protective behavior, we structure our results according to factors related to epidemiological, socioeconomic, and specific health conditions. As most of the predictors of self-protecting behaviors are categorical, we create box plots to show the difference in the average marginal contribution of each category in a more robust way. The y-axis of the box plots shows the SHAP value of the variable, and on the x-axis are the values that the variable takes. We then systematically investigate interactions between features which have some policy relevance. In the end, we report heterogeneity analysis by country.
Epidemiological factors. Epidemiological models of the spread of infection diseases postulate that people start reacting against contracting a disease with self-protective measures whenever they are informed about the disease and when the burden of the disease is at a recognizable stage 41 . The variables typically used in epidemiological models include population density, stringency of lockdown measures, number of new infections and fatalities, confidence in institutions, and knowledge about the specific disease, COVID-19 in this case. Our results confirm that these indicators are all important predictors of self-protecting behavior. Figure 2 depicts the effects of local case and death rates and the national policy responses on self-protecting behaviors. Our results show that total number of cases and deaths linked to COVID-19, and the OxCGRT stringency index are ranked first, second, and third most important predictors of self-protecting behaviors respectively. A striking positive non-linear pattern exists between the stringency index and self-protecting behaviors. From panel (a) of Fig. 2, we see that a stringency index of 80 (with the index ranging between 0 and 100, 100 = most stringent response) is an important threshold over which any increase in strictness of policy response triggers a level of self-protecting behaviors higher than the mean prediction (positive SHAP values).
Panels (b, c) of Fig. 2 contain the partial effects of infection rate on individuals' behavior. Both measures of infection rate show a step like association with self-protecting behaviors. Behavioral responses remain quite unresponsive with respect to national infection rates for a while and jump higher following the total number of deaths from COVID-19 spiking up to above 20,000 from somewhere below 7000 deaths. From panel (d), we observe a somewhat non-linear negative association between the degree of population density and the level of www.nature.com/scientificreports/ self-protecting behaviors. Overall, living in a densely populated area/city/town/village is associated with lower self-protecting behaviors. Confidence in the essentials, government, and health services are the top 6 th , 7 th and 9 th most important features in our model, respectively. We find a consistent positive association between confidence in the essentials, government, health services, and the SHAP of each feature with correlation coefficients of 0.74, 0.77, and 0.84, respectively ( Fig. 3 panels (b)-(d)). We also identify an important level of confidence threshold (shown by the red dotted vertical line in the panels) over which the three features about confidence in major institutions are associated with positive SHAP values. We find higher threshold for confidence in essentials (degree of confidence = 6), compared to the other two features (degree of confidence = 5).
Panel (a) of Fig. 3 reveals a strong positive relationship between self-assessed level of respondent's knowledge about COVID-19 and level of self-protecting behaviors. It is the 4th most important predictor of self-protecting behaviors. We discover an important threshold (level of knowledge = 5) over which the level of knowledge is associated with positive SHAP values.
Interestingly, all features relating to confidence in authorities and knowledge about COVID-19 have long left tails shown in the summary plot, suggesting that the lower degree of trust and awareness about the pandemic is exceptionally detrimental to self-protecting behaviors as compared to a higher level of confidence is in enhancing protective behavior.  www.nature.com/scientificreports/ that is, individuals with an equivalent household income above 60% of the country's median) tend to report a higher level of engagement in protective behaviors against COVID-19. This suggests that adherence to some sanitary recommendations (e.g., physical distancing) is a costly option to relatively poor households. Hence, authorities should devise a buffering mechanism that could enable the financially vulnerable individuals to cope with the economic burden of pandemic health behaviors. We also find that access to open-air at home are among the strong predictor of self-protecting behaviors and they positively impact respondents' protective behaviors. We assume that better home features are indicators of higher wealth. RF prediction shows that respondents that live in homes without access to open-air have, on average, self-protecting behaviors that are well below the mean prediction. Having access to open-air with at least one home feature guarantees the level of self-protecting behaviors that is above the mean prediction. Of the different home features considered in our model, having a balcony appears the most important in terms of relative contribution to the model, whereas access to garden, terrace, and park do not feature in the top 30. The reason that balcony is among the main factor determining people's behaviors is that during COVID-19, it is through their balconies that people enjoy at least some sense of being outdoors while confined at home. Its functionality as a result has been diversified to be as a work corner, a place to exercise, a place of contemplation, and fresh air without the anxiety of breaking the lockdown rules.
Housing features, such as access to open-air and living in accommodation that are not overcrowded, enable individuals to comply with protective measures better. In addition, lack of access to outdoor spaces affects dwellers' mental health during the pandemic influencing decisions to engage in protective behaviors 42  Specific health conditions. The role of specific health conditions has been emphasized regarding the severity of COVID-19 once infected 6 . We also find that having at least one pre-existing health condition is strongly associated with higher self-protecting behavior, as shown in panel (a) of Fig. 5. Moreover, we assess how each pre-existing condition fares in terms of contributing to self-protecting behaviors prediction. As evident from panel (b), respondents with the pre-existing condition of high blood pressure tend to engage in a higher level of self-protecting behaviors than the others. Panels (c)-(e) of Fig. 5 display the impact of health and behavioral risk factors (described in "Data") on the overall model prediction. We find evidence that past alcohol consumption and obesity are negatively related to self-protecting behaviors. There is strong evidence that alcohol consumption is negatively associated with self-protecting behaviors. Obesity is associated with increased risk of severe illness from COVID-19 43,44 . At the same time, obesity is strongly related to low socioeconomic status 45 , that may explain the negative relationship between obesity and self-protecting behaviors.
Although smoking is an adverse health behavior, interestingly, our results show that smokers (including those who currently smoke or ex-smokers) tend to manifest a higher level of self-protecting behaviors than non-smokers. Hence, this increased risk of COVID-19 for smokers might motivate them to exhibit higher compliance. Consistent with Dai and colleagues 46 , who report that smokers tend to suffer severe outcome of COVID-19, whereas alcohol consumption was not linked to severe complication from the virus, we find that alcohol consumption deos not motivate greater self protection. Similarly, other study argue that smokers tend to have more severe symptoms of COVID-19 compared to non-smokers 47 .
At the beginning of the pandemic older people have been found to be disproportionately more likely to have severe symptoms from COVID-19 leading to hospitalization and death. We find that the nature of the relationship between age and self-protecting behaviors is somewhat non-linear, i.e., for age groups older than 50 years of age, we do not find a significant difference in the average level of self-protecting behaviors. Moreover, 40 years of age appears to be an essential threshold over which the number of years is associated with positive SHAP values.  Fig. 6, we examine the partial effect of individuals' level of education and their corresponding self-protecting behaviors. We do not find a monotonically increasing protective behavior with educational attainment. Our results show a non-linear association between education and self-protecting behaviors: the lowest level of selfprotection is associated with the lowest education group. It peaked with the middle education group and then slightly decreased for the highest education group. The increasing part (in the partial effect of education) reaffirms the strong association between socioeconomic status and self-protective behaviors. The slight drop in self-protection from middle to the highest education level could be explained by differential exposure to the pandemic for the high socioeconomic group. One such important difference is remote work. For instance, in our data, 36% of individuals with higher education levels responded that they were working at home during the first wave of the pandemic. This remote working arrangement is disproportionately higher for the more educated groups than the middle (16.5%) and lower (6%) education groups. Thus, it is plausible to argue that individuals with higher educational attainments have a low risk of getting infected with COVID-19 in the first place than those with low and middle educational attainment 48 . Our model identifies gender as the 5th top predictor of self-protecting behaviors. As evident from panel (b) of Fig. 6, there are substantial gender differences in health-protective behaviors. Females tend to exhibit a higher level of self-protective behavior. Gender differences in fear and risk perception can potentially explain this pattern 49 . The intra-household bargaining theory can also explain the residual effect-where women engage more in household caretaking roles while men continue to work in person [50][51][52][53] .
From panel (c) of Fig. 6, we see the partial effects of family status of individuals in determining the level of self-protecting behaviors in our model. Single respondents (ranked 8th) react less to the pandemic. This study also isolates a positive, partial impact of being married and cohabiting vs. married but living apart. Past work documented that loneliness is linked with lower engagement in COVID-19 protective behaviors, suggesting that individuals who live alone could find it more challenging to restrain themselves from meeting others as they lack companionship at home 54 .
In panel (d) of Fig. 6, we present the partial effect of country of residence on the level of self-protecting behaviors. The results show that compared to respondents from Italy, Spain, France, and Germany, respondents from Sweden tend to exhibit lower level of self-protecting behaviors during the first wave of the pandemic. In fact, Sweden was the only country never in lockdown. www.nature.com/scientificreports/

Measuring interaction effects.
In the preceding sections we highlighted the top-ranking features that are predictive of self-protecting behaviors. In some cases, it is plausible to assume that the relationship between the target variable and a feature depends on the value of another feature. Such contextual dependence between features that jointly impact predictions are known in the ML literature as feature interactions. Our RF model automatically accounts for these interactions, and we now systematically report interaction effects that have some policy relevance. The interaction between confidence in essentials with the number of deaths from the COVID-19 shows that the relationship between self-protecting behaviors and confidence in essentials is alerted by the national incidence rate of the pandemic, observed in Fig. 7 panel (a). Those individuals who are highly confident (level of confidence on the Likert scale six and above) that essential services will be maintained during the pandemic exhibit a higher level of self-protecting behaviors proportional to the number of deaths. On the other hand, those respondents with lower confidence in essentials respond asymmetrically to the pandemic's national incidence rate. We find a similar impact on confidence in health services as seen in Fig. 7 panel (b).
Next, we investigate the interaction effect between knowledge about COVID-19 and the relative poverty status of individuals: in Fig. 7 panel (c), we see a more pronounced positive effect of knowledge about the pandemic and self-protective behaviors among the poor group than the non-poor, meaning that among the poor those who show a higher level of self-protection are the ones who have better awareness about the pandemic. However, in the non-poor groups, the interaction effect is slim, and self-protection behaviors are not necessarily tied to the level of knowledge. This result is in line with the partial effect of education examined above. It is safe to assume that a high level of education and better knowledge about COVID-19 are highly correlated.
These interaction effects imply that enhancing an individual's trust in institutions could enable people to show behavioral responses with respect to the national incidence rate. Moreover, awareness creation about the pandemic that targets the low socioeconomic group could result in a better behavioral response.

Discussion
In response to the COVID-19 pandemic, several national governments have applied lockdown restrictions and requested their citizens to undertake some behavioral changes to reduce the infection rate. Even though it is apparent that human behavior largely influences the spread of illness, it is unclear what factors most strongly correlate with protective behavior. We examine the most relevant predictors of individual self-protective behavioral responses during the first wave of the COVID-19 pandemic employing a data-driven approach of ExpML on real-time survey data from five European countries. Employing state-of-the-art machine learning approaches, we show that COVID-19 disease progression and the responses that governments give through policies induce an increase in self-protecting behaviors provided that both the number of COVID-19 infection/death rate and the strictness of policies are sufficiently high. The government policy setting and the local infection rates play an important role in individuals' responses to the health crisis. These factors guide how individuals adopt the protective guidelines. Higher local infection rates increase people's perceived risk of getting infected, inducing responses to protective behaviors. This was further reflected in the macro features relating to the country of residence. Compared to respondents in Sweden, respondents from Italy, Spain, France, and Germany tended to exhibit a higher level of protecting behavior during the first wave of the pandemic. This can be explained by the stringent lockdown measures in these countries in April and early May 2020. The Swedish government promoted individual responsibility rather than mandatory restrictions during this period.
Our findings shed light on some social and economic features that policymakers could modify to bridge the socioeconomic gap that we identified in self-protecting behaviors. Consistent with economic models that predict suboptimal individual behavior in the presence of strong externalities where the costs of protective behavior are unevenly distributed across socio-demographic groups, we find that people with lower socioeconomic status are less likely to adhere to protective behaviors. First, we show that increased knowledge about COVID-19 and trust in government, health care, and essential services is crucial to adopting protective behavior. Second, we find that a lower level of self-protecting behaviors is associated with lower income. Third, we observe that suitable living arrangements and localities are positively related to health-protective behaviors. First of all, many complex ML algorithms might, on the one hand, be highly predictive but might, on the other hand, lack intrinsic model interpretability. To circumvent this issue, we employed ExpML that has the advantage of providing an opportunity to improve prediction accuracy over standard OLS using a 'black-box' supervised ML algorithm while still revealing interesting insights into a complex outcome. More specifically, we built a RF model and complement it with model-agnostic interpretative tools. Related to this first limitation, interpretability and relative contribution of each predictor are influenced by the crosssectional nature of the data preventing any causal conclusions on the basis on this data set. It is also important to keep in mind that we analyze the early wave of the pandemic (i.e., Spring 2020) and without a vaccine available. Both the fact that more knowledge on the pandemic emerged over time and that a vaccine became available might have had an impact on the factors that drive self-protecting behaviors. We also only report a limited number of interactions between features out of all possible binary combination among them and limited our analyses to those interactions that we assumed to have the highest levels of policy relevance. Finally, overfitting poses a problem in the current study. To this end, we applied a five-fold cross-validation and identified an optimal set of hyperparameters (i.e., maximum depths of trees and maximum number of features per split) to reduce the potential impact of overfitting.

Conclusions
Key findings of this paper shed light on essential policy variables that help to slow the current pandemic and guide the design of future policies with important implications for researchers and policy makers. Effective communication by authorities that restores trust, providing basic support to the financially disadvantaged, increasing access to open air for those residing in houses that lack basic outdoor features are some of the policy levers that government could pull to incentivize changes in protective behavior.